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°So far we've seen the magic of learning in 
¢ Neurons, organisms, algorithms, and materials 


¢But what is it that makes ML so special and powerful? 
¢eAnd have we seen examples of “mathematics that learns” before? 
*Concerns, Limitations and Open Problems in ML 


What Makes ML So Special? 


¢Although there is no one specific answer, and may 
vary from ML to ML, some broader aspects include... 


The Power of Universal Approximation... 
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The Power of Universal Approximation... 


¢ NN with 1 hidden layer can represent: 


— any bounded continuous function (to arbitrary ¢€) 
* Universal Approximation Theorem [Cybenko 1989] 
— any Boolean function (exactly) 
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The Power of Universal Approximation... 


¢ NN with 1 hidden layer can represent: 
— any bounded continuous function (to arbitrary €) 
¢ Universal Approximation Theorem [Cybenko 1989] 
— any Boolean function (exactly) 


f(x) = 


A small term 
Oax* 


The Power of Universal Approximation... 


Theorem (Cybenko) 


Let o be any continuous discriminatory function. 
Then finite sums of the form 


N 
G(x) = S> ajo(w;! x + bj), where wj ER”, aj, bh ER 
j=l 
C) G(X) are dense in C (i53: 


In other words, given any « > 0 and f € C(I,), there is a sum 
G(x) of the above form such that 


|G(x) — F(x)|<e, VxeEl, 
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But Why? 


ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ 
GIKI - FES 


But Why? 


We all know 
about piece- 
wise linear 
approximations. 


ya 
f(x) = sin(2 x) — 10 +3 


We all know 
about piece- 
wise linear 
approximations. 


..we could split any continuous function into several regions 
and approximate its value in each region by height of a 


rectangle and still get arbitrarily close to the function values 
(as long as we have sufficient number of regions)! 
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Output from top hidden neuron 


b = -40 
= BD» 
w= 807 
x ( | es 
-b/w=0.50 | 


neuralnetworksanddeeplearning.com (a MUST VISIT Interactive page by Michael Nielsen to see all this in action). 
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Output from top hidden neuron 


b = -40 
= = 


Weighted output from hidden layer 


neuralnetworksanddeeplearning.com (a MUST VISIT Interactive page 


by Michael Nielsen to see all this in action). 
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Weighted output from hidden layer 


Output from top hidden neuron 3 


— 
- »> 


yee his 
M —> 0 7 
a x 7 iia ica 
7 
(0. 
ata eaea > Weighted output from hidden layer = he ad > 
yi." Fai." \ | Average deviation: 1.88 
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H 3; — 0.28 hy es : " f a _ 
‘ jer . Z _ Reset — 
* . \ ‘ 1 y \ : 
SS WwW, == 0.4 \ ( 0.6 ) 
> SS t \ ff 
ar ee, ih yx a= OF 
£ ee Po —> \ 0.8 } 
a Ks r ’ —~ 
. rai $5 a 0.57 _ > a oh 
ole | (0.8 } 
\ j \ } 1 Ne. Y 
*. \ rs , ie. h =e 1.5 
S — oa \ 
es { 1.0 } 
ae ee Me / 


neuralnetworksanddeeplearning.com (a MUST VISIT Interactive page by Michael Nielsen to see all this in action). iA 


¢But Universal Approximation is only part of the story... 
°Why? 


¢But Universal Approximation is only part of the story... 
¢Why? Because we have so many other universal approximators... 
*E.d.; 


¢But Universal Approximation is only part of the story... 
¢ Why? Because we have so many other universal approximators... 
°E.g., Polynomials, Power Series, Fourier Series... 


...other “Universal” Approximators 


rs n a-I , ? 
Phy= G,*" +4, X +a x +--+ AK +a, +9, 


...other “Universal” Approximators 


a a- 2 
P= Gx" +4a,_,X"'+a_ x “+ ane Fax + OLX tS, F 
sin(x) 
f(x) 
Weierstrass Approximation Theorem 
¢ If f(x) is a continuous real-valued function . : 
on [a, b] then for any ¢ > 0, then there 
exists a polynomial P,, on [a, b] such that 
-21 -1l 1 21 


If) — P(x) < € 
for all x € [a, b]. 


Taylor's theorem!“!°!I°] — Let k 2 1 be an integer and let the function f: R  R be k times 
differentiable at the point a € R. Then there exists a function h, : R — R such that 


f"(a) 


f\ (a) 


2 
(a —a)" +-+-+ iI 


f(x) = f(a) + f'(a)(w — a) + (a —a)* + hy(a)(a — a)", 


2! 
and 

lim hy (x) =, 

ra 
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...other “Universal” Approximators 


Fourier series, sine-cosine form 


8, (x) = Ao + s (A, cos (2722) + B, sin(2752) ) (Eq.2) 


n=1 
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...other “Universal” Approximators 


Fourier series, sine-cosine form 


8, (x) = Ap + 3 (A, cos (2722) + B, sin(272) ) (Eq.2) 


Inverse transform 


f(«) = / F() eae, VeER. (eq2) 


Eq.2 is a representation of f(a) as a weighted summation of complex exponential functions. 


Fourier transform 


qGe / f(x) e-** de. (Eq4) 
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In fact, approximation (ability to memorize and fit 
available data) alone Is not sufficient. 


¢Power of “generalization” (interpolation, extrapolation, 
fitting to new data, learning of general class) is critical, 
and modern ML seems to do exceptionally well in this. 


The 


Power of 


OVERPARAMETRIZATION... (and how it may help 


Occam’s Razor 


“Generalization” ) 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 


“Generalization” ) 


Occam’s Razor 


Pluralitas non est 
ponenda sine 
necessitate 


¢ “Plurality should not be 
posited without 
necessity.” 


In Mathematical 
Modeling Language: 
“Model parameters 
Should not be 


increased without | | | 
We ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 
necessity”. 
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The Power of 
OVERPARAMETRIZATION... 


Occam’s Razor 


Pluralitas non est 
ponenda sine 
necessitate 


¢ “Plurality should not be 
posited without 
necessity.” 


In Mathematical 
Modeling Language: 
“Model parameters 
Should not be 
increased without 
necessity”. 


ES69 
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(and how it may help 
“Generalization” ) 


But... 


- QOverparameterization seems to help Neural 


Networks 
- Why? 


Open question, 
with several 


theories... 


Ul 


The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Hypothesis: Overparameterization Leads to 


Sparsit _— 
p Y. Overparameterization lets the learning algorithm choose the 


best options (parameters) among data-dependent couplings and 
essentially “discard” the rest (leading to sparsity). 


The Power of . 
OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 
Hypothesis: Overparameterization Leads to 
Sparsity_ 


Overparameterization lets the learning algorithm choose the 
best options (parameters) among data-dependent couplings and 
essentially “discard” the rest (leading to sparsity). 


The original “dense” network (left) 
and its “pruned” subnet (right) both 
give very similar performance _ if 
Subnet initialized with same weights 
that original network was initialized 
with when it successfully learned. 


— ( Mathamaticc frar Marhina Aarninn /Mr Navyvann PRP Riitt Mm CII ccc 
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The Power of . 
OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 
Hypothesis: Overparameterization Leads to 
Sparsity_ 


Overparameterization lets the learning algorithm choose the 
best options (parameters) among data-dependent couplings and 
essentially “discard” the rest (leading to sparsity). 


Lottery Ticket Hypothesis 


Random initializations lead to some initializations that are 


fastest to train towards optima (recall Ant Colony, random 
“rections taken at first). 


® ® The original “dense” network (left) 
a i 


/ \ and its “pruned” subnet (right) both 
eee give very similar performance if 
>< Subnet initialized with same weights 


_< ~ 
ry @ @ that original network was initialized 
with when it successfully learned. 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Overparameterization and Random Initializations Help Weight-Update 


Algorithm 
g° - Gradient Descent has been shown to provably optimize 


overparametrized NNs. 
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The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Overparameterization and Random Initializations Help Weight-Update 


Algorithm 
g° - Gradient Descent has been shown to provably optimize 


overparametrized NNs. 


Algorithmic Stability Hypothesis 


- Algorithms that do not vary too much with slight changes in 
training datasets, generalize better (possibly indicating that 
they have learned underlying distributions rather than just the 
data). 


The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Manifold Hypothesis 


- Many high-dimensional data sets that occur in the real world 
actually lie along low-dimensional latent manifolds inside that 
high-dimensional space. 

- As a consequence, many data sets that appear to initially require 
many variables to describe, can actually be described by a 
comparatively small number of variables. 

- Machine learning models only have to fit relatively simple, low- 
dimensional, highly structured subspaces within their potential 
Input space (latent manifolds). 

- Within one of these manifolds, it is always possible to interpolate 
between two inputs, that is to say, morph one into another via a 
continuous path along which all points fall on the manifold. 


The Power of 


OVERPARAMETRIZATION... (and how it may help 
“Generalization” ) 


Manifold Hypothesis 


R° : » 
® 


Fig. 2. Data is often embedded in (lies on) a lower- 

dimensional structure or manifold. It should be 

possible to characterize the data and the 

Fig. 1. Example of high-dimensional data lying in low-dimensional subspaces. It is relationship between individual points usINg fewer 

seen that rather than uniformly distributed in the 3-dimensional space, these data . : . : 

points lie on the union of two lines and one plane. dimensions, if we nee able to measure distances 
on the manifold itself instead of in Euclidean space. 
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¢Of course, we could get ourselves in trouble (e.g., 
computationally) with heavy overparameterization!! 
¢ Next, let’s look at some other aspects that help in this regard. 


The Power of LOTS of (REPEATED) SIMPLE UNITS 


with SIMPLE RULES ith localized “gating” of information...) 
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The Power of LOTS of (REPEATED) SIMPLE UNITS 


with SIMPLE RULES ith localized “gating” of information...) 


Lab (change 


Learning without kenwes 


+) 
yy 
. INN WA 


») : 
Y NU WE N 

\ SASS 
MX 3 N . =—— SS 


1. weigh 2. sum up 3. activate 
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@wvith localized “gating” of information... 


SN & 


ag hse 


indation 


sum _ bias } 


eee = = I 1.4 


OSS 


whe 
\\\ ty AANA 


Start 


1. weigh 2. sum up 3. activate 
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Advantage of Having More Parameters (e.g., 


Weights) to Play With... (without exploding 
computations...) 
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Advantage of 
Weights) to Pla 


Having More Parameters (e.g., . . 
With... (without exploding 
computations...) 
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Advantage of Having More Parameters (e.g., 


Weights) to Play With... (without exploding 
computations...) 


f(z) f(z) 
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The Power of Bases (e.g., Features) and Their 
Weighted Combinations... 
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The Power of Bases (e.g., Features) and 
Weighted Combinations... 


‘Top-down 
term 


CRF 


Max-margin Model 


Objective 


Over- 
segmentation 


Bottom-up 
term 
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The Power of Bases (e.g., Features) and 
Weighted Combinations... 
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The Power of Basis (e.g., Features) and 
Weighted Combinations... 


OQ. Can we write 
functions as 
weighted sums of 
SINUSOICS 
(features)? 
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The Power of Basis (e.g., Features) and Their 
Weighted Combinations... 


Ln 
f, 1 
f, 0.5 
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Ingredient Amount 


(sinusoid (scaling) 
frequency) 


Add all 
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The Power of Basis (e.g., Features) and Their 
Weighted Combinations... 


Scalin 
fe) 
Fourier 
TS 
ie 
E 1 
< 
Tim Inverse 0. 
e Fourier 0.25 
Transform 
5 1 1 
If we pick the right basis/features, ere 
the problem could become very 


Sparse! 
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Now to Our Second Question... 


¢In our learning of mathematics, have we seen 
“mathematics that learns” before? 


Roots of a Polynomial... 


Analytical 
(Direct Solution) 


Solve ,°447+8=-2,° analytically (symbolically). 
x +2x°+4x+8=0 © standard form 
x (x+2)+4(x+2)=0 


(x7 +4)(x+2)=0 


x°+4=Oorx+2=0 © zero product property 


Factor by grouping 


2 - 
x =—-4orx=-2 So if we can factor into linear and 
x=HViorx=—2 quadratic factors, we can find the exact 
_ values of all real and complex roots. 
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Roots of a Polynomial... 


Analytical 
(Direct Solution) 


Solve »~°44,+4+8=-—2,° analytically (symbolically). 
x +2x7+4x4+8=0 Standard form 
x°(x+2)+4(x+2)=0 


(x7 +4)(x+2)=0 


x°+4=Oorx+2=0 zero product property 


Factor by grouping 


2 —= oo, =__ . . . 
x =—4orx=—2 So if we can factor into linear and 
x=+t2iorx=—2 quadratic factors, we can find the exact 


values of all real and complex roots. 


Vs. 


Algorithmic 
(Iterative/”Learning” 
Solution) 


Calculate a and b 


Procedure bisection method (a,b,é) 
a+b 


2 
Compute derivative of f(x) denoted as f (x) 
While |a — b| > € and f(x) + 0 do 


If f(a) x f(c) < 0 then «——__| Negative value 
a indicates that 

Else and areon 
acc opposite sides 
se = of the root. 


0. Returnaorborc 


Which one Is easier to teach a machine? (hint: the one with simple 


repeated steps). | | | 
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Roots of a Polynomial... 


Algorithmic 
(Iterative/”Learning” 
Solution) 


Root Approxination: Bisection 


Calculate a and b 


1. Procedure bisection method (a,b,é) 
a+b 
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2. c=— 

2 
3. Compute derivative of f(x) denoted as f(x) 
4 While |a — b| > € and f(x) # 0 do 
5 If f(a) x f(c) < O then «———___ Negative value 
6. a indicates that 
7 Else and are on 
8 eee opposite sides 
9 ay of the root. 
10. Return a orb orc 
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Roots of a Polynomials smarter way would be to 


use knowledge of function local 
behavior (e.g., gradient) to plan 


ire iiw nAwTt marin l 


Root Approxination: Bisection Root Approxination: Newton Method 


H°3<K°2=-H-1 : H°3<K°2—-H-1 
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I'm sorry. I'm afraid his | 


condition is crtteal. 


Finding Optima.., Araytica peat TI 


(Direct Solution) 


f(x) =z sin(z*) +1 5 


\ 
A = (—2, 2.51) 


2008 © COURTNEY GIBBONS 


Find derivative, and 


Fae Ps HEY, x |+2.x*cos|x7|=0 
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I'm Sorry. Imafrard his | 


Condition is critical. 


FI nd | Ng O pt Ma. \birack colt 
f(x,y) = —2*+4(a*—y’) —3 = 


Wall Doce”, 


f(x) =z sin(z*) +1 ; ; 
\ A = (—2,2.51) Find gradient 
} a QIBBONS 
O 
Fp (oe + A(z’ — ¥') — 3) 
—4z3 + 8 
Vf-= _ x hd " 
O — 
1 3 — (—a* + 4(a? — 7”) — 3) 
Oy 


f'(—2) = —5.99 
Solve system of equations for and 


Find derivative, and 


Fae Ps HEY, x |+2.x*cos|x7|=0 
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Finding Optima __, Analytical 


(Direct Solution) 


f(z, y) = —a* + 4(2? — 7) -—3 


f(z) = sin(a*) +1 
\ A = (—2,2.51) Find gradient 
eee 2.008 © COURTNEY GIBBONS 
0 
Hp (— 2 + 4(2* — 9’) — 3) 
—Ag? + 8x 
Vf= _ | | 
0 4 2 by 
a + 4(x? — y*) — 3) 


Solve system of equations for and 


Find derivative, and 


e for 2 Imagine having to do so for 
7 Be =sin| XP) 42 cos| x \=0 a function of thousands or 
millions of variables! 
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Finding Optima. sierstiver"teamino” 


Solution) 


Gradient Descent 


dJ 
Wnew ~ Wold = Fer 


Weight (W) 


Minimum point of cost function 
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Algorithmic 


FI nd | NQ O oti mM a » {lkerative/”Learning” 


Solution) 


Gradient Descent 


Initial Weight (wou) 
Learning rate (a) 


Loss (J) 


New Weight (ye, 


dJ 
Wnew ~ Wold = Fer 


Weight (W) 


Minimum point of cost function 


Algorithm 2: Gradient Descent 
input : f:R”" > Ka differentiable function 
x an initial solution 
output: x”, a local minimum of the cost function f. 


1 begin 
2 k<Q; 
8 while sTOP-CRIT and (k < kar) do 
4 xt] 2 x(k) — QO f(x) ; 
with a = arg min f(x — aVf(x)) ; 
ae 
k<k+1 
7 return x(*) 


Also approximated 
numerically! 
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FINdIN e O oti Ma. ieaeeiteae*t auaduel 


Solution) 


We could try 


\ 
is a 
VO 


multiple 
Oe at initializations to 
WSS gee avoid local minima. 


nN. 


SS 
\ AS 
ZS 


Sa NG Ix 
‘ ~ Ui 


NS, 
— 
Wks SS sLL/ 
VESTER 
WE’ 

SSS 
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Guess Who... ?“tearning” 


BAYES RULE: 


P(AIB) = P(A) x 


Learned/ 
Updated 
probability of 
event given 
new 
data/informati 
on 


Probabilities 


P(B/A) 
P(8) 


Initial probability Statistics of new 


of an event data and its 
(assumed or statistical relevance 
based on past to event of interest 


data) ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 
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Guess Who... ? “earning” 


Probabilities 


BAYES RULE: 


P(B/A) 
P(B) Prior —> Bayes 


Probability Theorem 


P(AIB) = P(A) x 


Iterative “learning” 


Learned/ 

Updated Initial probability Statistics of new GoneN aha vce 
probability of of an event data and its 

event given (assumed or statistical relevance 

new based on past to event of interest 

data/informati data) £5691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 59 


on 


G UeSS Who. o. ? “Learning” 


Probabilities 
Illustration: Guessing the Pet 36 
Problem Statement: ©"? “°F 2.408! the box? ud a>? 
Initial Belief: 50/50 chances dihey °? Ai. 
Bea » P(Cat) = 
re we XY —-~ P(Dog) = 


The Pet is 


@ We have a CLUE: Pet is quiet 
very Quiet 


——+ (P(Quiet | Cat))=80% or 0.8 
——+ (P(Quiet | Dog)) = 30% or 0.3 


From Bayes Theorem probability that pet is a cat given it is quiet is: 


P(Cat| Quiet): | P(Quiet Cat)xP(Cat) | 
P(Quiet) 
— : 60 


The Good, the Bad, and the Ugly... 


ML Concerns, Limitations, and Open Problems 


The Good, the Bad, and the Ugly... 


ML Concerns, Limitations, and Open Problems 


Data Acquisition 


High Susceptibility to Errors & Biases 
Heavy Reliance on Data Quality 


Concerns of Data Privacy 
Investment of Time & Resources 


Ethical Concerns 


Difficulties in Interpretation 
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1. They Can be an Overkill... 


Tell me the truth:.I'm..I'm ready 
to hear it. 


You don’t need machine 
learning for that. 
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2. Data Hungry, Hardware Hungry, 
Power Hunory... 


FEEDING THE AI BEAST — 


Power-hungry AI is putting the hurt on global 
electricity supply 


Data centers are becoming a bottleneck for AI development. 


CAMILLA HODGSON, FINANCIAL TIMES - 4/17/2024, 6:55 PM 


Even on small organizational scale, 
reliable training requires good deal of 
reliable data. 


3. Do Not Capture Causal Relations... 


ML 
generally 


works on 
: : CORRELATION 
Statistical 


relations. 


ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 


3. Do Not Capture Causal Relations... 


Stop eating Ice- 


» ff 
$ 
Me cream! 
ML 
generally 


works on 
: : CORRELATION 
Statistical 


relations. 
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3. Do Not Capture Causal Relations... 


Don’t post your  - 
summer travel a 


plans on Facebook! Ww 


[me 


CAUSATION 


CAUSATION 


But causal relations 
often needed to truly 
understand and _ work 
on problems. 
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4. Bias, Reproducibility, and 
Verifiability... 


Hard to identify ethical 
biases in training data 
and in results provided by 
a big ML algorithm. 


4. Bias, Reproducibility, and 
Verifiability... 


Hard to identify ethical 

bi In training dat . 

sory npr eer Al Model: No Loan For You 
elit ME: Why 


Example: Biased 
“unchallengeable” 
decisions could 
worsen economic 
disparity. 


y Sf bh. pi 

_ We're using ; . 

| Alinstead What did you 
‘of biased eo }train the Al on? 


re 


a7 ; 
-W AWhat did you... 
* train the Al on? 


} 
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4. Bias, Reproducibility, and 


Verifiability 


Hard to identify 
ethical biases in 
training data and in 
results provided by a 
big ML algorithm. 


is? Reuters 


A Survey on Bias and Fairness in Machine Learning 


NINAREH MEHRABI, FRED MORSTATTER, NRIPSUTA SAXENA, 
KRISTINA LERMAN, and ARAM GALSTYAN, USC-ISI 


Insight - Amazon scraps secret Al 
recruiting tool that showed bias against 


women 
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- ML was trained to find “good CVs” using CVs of 


highly successful people in Silicon Valley 
(which are mostly men - for reasons other than 
competence). 


- It started rejecting CVs with “feminine” 


language. 
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4. Bias, Reproducibility, and 
Verifiability... 


Hard to verify and 
reproduce operation of 
large models. 


nature 


Explore content Y Aboutthejournal Y Publish with us v Subscribe 


nature > news feature > article 


NEWS FEATURE | 05 December 2023 


Is Al leading to a reproducibility 
crisis in science? 


Scientists worry that ill-informed use of artificial intelligence is driving a deluge of 
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4. Bias, Reproducibility, and 


Verifiability. 


large models. 


Why 


nature z 


Hard to verify and r? 
reproduce operation of om | 


Facilitates collaboration and review processes 
Ensures continuity of work and knowledge 
exchange retains 

Provides opportunity for future evaluations 
Verification of results for hidden biases can help 


Explore content Y Aboutthejournal Y Publish with us re Cbercie t h em 


nature > news feature > article 


NEWS FEATURE | 05 December 2023 


Is Al leading to a reproducibility 


crisis in science? 


Scientists worry that ill-informed use of artificial intelligence is driving a deluge of 


unreliable or useless research. 
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5. Black Box - Explainability Issues... 


ML learned “features” 
are not always aligned 
with what we consider 
features. 


5. Black Box - Explainability Issues... 


Training 
Data 


ML learned “features” 
are not always aligned 
with what we consider 
features. 


—-> It's an apple! 


* Why did you do that? 
* Why not something else? 


Learning cafeiate This is a cat - When do you succeed? 
Process ef ew (p = .93) * When do you fail? 
« When can | trust you? 
* How do | correct an error? 
Learned Output User with 
Function a Task 
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6. Vulnerabilities 


Deep ML, tn particular, 
IS prone to adversarial 
attacks and errors in 
data. 


6, Vulnerabilities 


; ; + .007 x 
Deep ML, tn particular, 
IS prone to adversarial 
attacks and errors in 
data. “oanda’” noise “gibbon” 
57.7% confidence 99.3% confidence 


aum Adversary/Attacker 


ro 2 


“To err is human, but to 
really foul things up you 
need a computer.” 

— Paul Ehrlich 


Training Set Training Model Tests Wrong Output 


7. [Some] Open Problems 


¢Why does machine learning (particularly, deep learning) 
work so well? 


¢We saw some recent hypotheses. Can you add more or prove 
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7. [Some] Open Problems 


*How to make ML reproducible, verifiable, and bias-free? 
¢How to make ML more explainable? 
¢ New field: Explainable Al (XAIl) 


¢How to reduce vulnerabilities and defend against 
attacks? 


Questions?? Thoughts? ? 
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